In [27]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
sns.set_style('white')
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
Remember that, if you want to play with visualization tools, you can use not only the real data, but also fake data. Actually it is a nice way to experiment because you can control every aspect of data. Let's create some random numbers.
The function np.random.randn()
generates a sample with size $N$ from the standard normal distribution.
In [28]:
print( np.random.rand(10) )
The following small function generates $N$ normally distributed numbers:
In [29]:
def generate_many_numbers(N=10, mean=5, sigma=3):
return mean + sigma * np.random.randn(N)
Generate 10 normally distributed numbers with mean 5 and sigma 3:
In [30]:
data = generate_many_numbers(N=10)
print(data)
The most immediate method to visualize 1-D data is just plotting it. Here we can use the scatter()
function to draw a scatter plot. The most basic usage of this function is to provide x and y.
In [31]:
x = np.arange(1,11)
y = x + 5
print(x)
print(y)
plt.scatter(x, y)
Out[31]:
But here we only have x (the generated data). We can set the y values to 0. The np.zeros_like(data)
function creates a numpy array (list) that have the same dimension as the argument.
In [32]:
print(np.zeros_like(data))
Now let's plot the generated 1-D data.
In [33]:
plt.figure(figsize=(10,1)) # set figure size, width = 10, height = 1
plt.scatter(data, np.zeros_like(data), s=50) # set size of symbols to 50. Change it and see what happens.
plt.gca().axes.get_yaxis().set_visible(False) # set y axis invisible
Ok, I think we can see all data points. But what if we have more numbers?
In [34]:
# TODO: generate 100 numbers and plot them in the same way.
data = np.random.rand(100)
plt.figure(figsize=(10,1))
plt.scatter(data, np.zeros_like(data), s = 50)
plt.gca().axes.get_yaxis().set_visible(False)
Of course we can't see much at the center. We can add "jitters" using the np.random.rand()
function.
In [35]:
data = generate_many_numbers(N=100)
# TODO: create a list of 100 random numbers using np.random.rand()
# zittered_ypos = ??
zittered_ypos = np.random.rand(100)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s=50)
plt.gca().axes.get_yaxis().set_visible(False)
Let's also make the symbol transparent. Here is a useful Google query, and the documentation of scatter()
also helps.
In [36]:
data = generate_many_numbers(N=200)
# From the last question
# zittered_ypos = ??
# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)
# TODO: implement this
zittered_ypos = np.random.rand(200)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s = 50, alpha = 0.35)
plt.gca().axes.get_yaxis().set_visible(False)
We can use transparency as well as empty symbols.
In [37]:
# TODO: implement this
# data = ??
# zittered_ypos = ??
# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)
data = np.random.rand(1000)
zittered_ypos = np.random.rand(1000)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s = 50, c = 'white', edgecolors='r')
plt.gca().axes.get_yaxis().set_visible(False)
Let's use real data. Load the IMDb dataset that we used before.
In [38]:
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
movie_df.head()
Out[38]:
Try to plot the 'Rating' information using 1D scatter plot. Does it work?
In [39]:
# TODO: plot 'rating'
rating = movie_df['Rating'].values
plt.figure(figsize=(10,1))
plt.scatter(rating, np.zeros_like(rating), s = 50)
plt.gca().axes.get_yaxis().set_visible(False)
There are too many data points! Let's try histogram. Actually pandas
supports plotting through matplotlib
and you can directly visualize dataframes and series.
In [40]:
movie_df['Rating'].hist()
Out[40]:
Looks good! Can you increase or decrease the number of bins? Find the documentation here.
In [41]:
# TODO: try different number of bins
movie_df['Rating'].hist(bins = 30)
Out[41]:
In [42]:
movie_df['Rating'].hist(bins = 20)
Out[42]:
Now let's try boxplot. We can use pandas' plotting functions. The usages of boxplot is here.
In [43]:
movie_df['Rating'].plot(kind='box', vert=False)
Out[43]:
Or try seaborn's boxplot()
function:
In [44]:
sns.boxplot(movie_df['Rating'])
Out[44]:
We can also easily draw a series of boxplots grouped by categories. For example, let's do the boxplots of movie ratings for different decades.
In [45]:
df = movie_df.sort('Year')
df.head()
Out[45]:
One easy way to transform a particular year to the decade (e.g., 1874 -> 1870): divide by 10 and multiply it by 10 again.
In Python 3, the //
operator is used for integer division.
In [46]:
print(1874//10)
print(1874//10*10)
decade = (df['Year']//10) * 10
decade.head()
Out[46]:
In [47]:
ax = sns.boxplot(x=decade, y=df['Rating'])
ax.figure.set_size_inches(12, 8)
Can you draw boxplots of movie votes for different decade?
In [48]:
# TODO
ax = sns.boxplot(x=decade, y=df['Votes'])
ax.figure.set_size_inches(12, 8)
What do you see? Can you actually see the "box"? The number of votes span a very wide range, from 1 to more than 1.4 million. One way to deal with this is to make a log-transformation of votes, which can be done with the numpy.log()
function.
In [49]:
log_votes = np.log(df['Votes'])
log_votes.head()
Out[49]:
Can you draw boxplots of log-transformed movie votes for different decade?
In [50]:
# TODO
ax = sns.boxplot(x=decade, y = log_votes)
ax.figure.set_size_inches(12, 8)
In [ ]: